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Human alternative open reading frames (HAItORF) is a publicly available and searchable online database referencing 
putative products of out-of-frame alternative translation initiation (ATI) in human mRNAs. Out-of-frame ATI is a process 
by which a single mRNA encodes independent proteins, when distinct initiation codons located in different reading frames 
are recognized by a ribosome to initiate translation. This mechanism is largely used in viruses to increase the coding 
potential of small viral genomes. There is increasing evidence that out-of-frame ATI is also used in eukaryotes, including 
human, and may contribute to the diversity of the human proteome. HAItORF is the first web-based searchable database 
that allows thorough investigation in the human transcriptome of out-of-frame alternative open reading frames with a 
start codon located in a strong Kozak context, and are thus the more likely to be expressed. It is also the first large scale 
study on the human transcriptome to successfully predict the expression of out-of-frame ATI protein products that were 
previously discovered experimentally. HAItORF will be a useful tool for the identification of human genes with multiple 
coding sequences, and will help to better define and understand the complexity of the human proteome. 

Database URL: http://haltorf.roucoulab.com/. 



Introduction 

Each eukaryotic mRNA encoding a protein is usually asso- 
ciated with only one open reading frame (herein called ref- 
erence ORF) or coding sequence (CDS) delineated by a start 
codon (most of the time AUG) and a stop codon, required 
to initiate and end translation, respectively. This simplistic 
view is however being challenged by the existence of at 
least two mechanisms resulting in increased protein diver- 
sity. In-frame alternative translation initiation (ATI) at 
downstream AUG codons allows the production of trun- 
cated protein isoforms with new functions or localization 
and is a well-characterized mechanism in eukaryotes (1,2). 
Out-of-frame ATI at the start codon of alternative ORFs 
(AltORFs) in the two other reading frames is a second 



mechanism producing proteins with an amino acid 
equence completely different from the reference protein. 
The nomenclature regarding reading frames used there- 
after is the following (3). The +1 reading frame is deter- 
mined by the coding sequence of the reference ORF for 
each transcript (independently of the gene or transcript). 
Hence, the annotated reference ORF is defined as frame +1, 
and there are two possible frames for AltORFs: frame +2 
and frame +3. 

The presence of overlapping ORFs and the use of 
out-of-frame ATI are well described in viruses (4-6) and 
provide small viral genomes with an increased coding 
capacity. In addition, a database referencing putative 
alternative ORFs in many prokaryotic genomes already 
exists (7). The role of out-of-frame ATI in eukaryotes has 
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been overlooked. Yet, there is some evidence that proteins 
derived from AltORFs can affect physiological as well as 
pathological aspects of gene function. This is the case for 
the alternative protein ALEX encoded in the GNAS gene 
(8,9). In addition, we recently discovered the endogenous 
expression in human of an alternative protein product 
termed AltPrP which ORF(+3 reading frame) partially over- 
laps with the prion protein CDS (Figure 1) (10). Four other 
examples exist in human (1 1-14), which correspond to pep- 
tides that are targeted by anti-tumor responses in several 
types of cancers, and may thus serve as biomarkers or thera- 
peutic targets (15). Interestingly, these AltORFs are all but 
one included within the reference ORF (11). This observa- 
tion is critical since the expression of cDNAs composed 
solely of the CDS in experimental systems such as cultured 
cells may actually result in the expression of more than one 
protein (10). Consequently, co-expression of an alternative 
protein together with the reference protein in functional 
studies likely result in unnoticed confounding results. A 
database containing a list of all human mRNAs containing 
AltORFs overlapping with the reference ORF is important to 
identify potential genes with multiple CDS. 

To our knowledge, three bioinformatics genome-wide 
studies aiming at the identification of AltORFs in mammals 
have been performed previously (16-18). However, none of 
them provided an online searchable option with links to 
GenBank and NCBI databases for further investigation. In 
one study, criteria such as conservation among species and 
a minimum length of 500 bp for the predicted AltORFs 
were used and only 40 putatively expressed AltORFs were 
referenced (16). In a more recent study, 138 potential dual 
coding transcripts were identified in human (18). In another 
study, a filter of a minimal length of 150 bp was applied 
and 1793 AltORFs were found to be conserved among rat, 
mouse and human (17). When the 1793 human AltORFs 
were filtered for the presence of an optimal Kozak context 
around the initiator AUG codon, known to be extremely 
important for efficient initiation of translation (19), this 
number dropped to 217 putative AltORFs. One objective 
of these three studies was to predict high confidence can- 
didate AltORFs, and the highly stringent criteria used were 
extremely pertinent in this matter. However, they were un- 
successful in predicting the expression of two experimen- 
tally proven AltORFs, AltPrP and ALEX. For all these 
reasons, it is obvious that a less stringent and potentially 
more comprehensive large scale bioinformatics analysis of 
AltORFs in the human transcriptome and a publicly avail- 
able and searchable online database of predicted AltORFs 
are lacking. 

Human alternative open reading frames (HAItORF;http:// 
haltorf.roucoulab.com/) is the first web-based searchable 
database that allows thorough investigation in the 
human transcriptome of AltORFs overlapping with anno- 
tated CDS, and putatively expressed by out-of-frame ATI. 
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Figure 1. AltPrP, a typical example of AltORFs in the HAItORF 
database. All mRNAs produced from the PRNP gene have the 
same reference ORF (nt 1-762, gray box) which encodes the 
prion protein (PrP c ) in the +1 reading frame. An AltORF 
(white box) is present in the +3 reading frame (nt 90-309). 
Similar to all AltORFs present in the database, the alternative 
prion protein (AltPrP) encoding AltORF is entirely included in 
the CDS of the reference protein, and encodes a protein 
longer than 24 amino acids (minimum size threshold). 
Additionally, its AUG codon is in a different reading frame 
than the reference protein, and is located in an optimal 
Kozak context (shown in bold; consensus: A/GNNAUGG). 



It is also the first large scale study on the human transcrip- 
tome to successfully predict the expression of AltPrP and 
ALEX, two experimentally discovered out-of-frame ATI pro- 
tein products. HAItORF will be a useful tool for the identi- 
fication of genes containing multiple CDS in human, and 
will help to better define and understand the complexity of 
the human proteome. 

Database generation 

The HAItORF database was built using a pipeline of Perl 
scripts that populate a MySQL database. All GenBank 
human mRNA and protein entries (release 37) were down- 
loaded from the NCBI website (http://www.ncbi.nlm.nih 
.gov/), and each mRNA was associated with its reference 
protein. For each mRNA, in silico translation of the full 
sequence was performed using the Transeq software (20), 
and subsequent comparison of the results with the amino 
acid sequence of the reference protein allowed to map the 
translation start and stop sites coordinates of the reference 
ORF on its corresponding mRNA. The sequence 5' of the 
translation start site of the reference ORF was then deleted. 
This action set the reading frame associated with the ref- 
erence ORF in each mRNA to +1. The remaining sequence 
was then translated again using the Transeq software. All 
translation results equal to or above 24 amino acids, 
regardless of the reading frame, were stored in the data- 
base along with their start and stop sites coordinates. The 
arbitrary threshold of 24 amino acids was selected to 
reduce the database to an acceptable size, since we (data 
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not shown) and other groups (16,17) noticed that the num- 
bers of predicted AltORFs increases as the size threshold 
decreases. Additionally, the validation of the expression 
of smaller peptides by standard techniques, such as 
SDS-PAGE and western blots, would be technically too chal- 
lenging. Next, based on a simplified consensus Kozak 
sequence (A/GNN ATG G) known to be favorable for effi- 
cient translation initiation (19), we determined for each 
predicted ORF start site if it was located in a strong (perfect 
fit to the consensus) or weak (any other sequence) Kozak 
context. The last step was to select, in the CDS of each 
mRNA, the putative AltORFs that are the most likely to be 
expressed. To do so, we filtered the database using the 
following criteria: (i) ORFs had to be in the +2 or +3 reading 
frames to be selected, thus storing AltORFs, which are cur- 
rently absent from existing protein databases; (ii) the pre- 
dicted AltORFs had to possess a strong Kozak context 
around their AUG codon, to increase the chance of efficient 
translation initiation; (iii) the stop site of the AltORFs had to 
be located prior to the stop site of the reference ORFs, thus 
removing ORFs that are not entirely contained within the 
CDS of the reference protein. More details on the construc- 
tion of the database are available on the HAItORF website. 
For a typical example of AltORFs found in this new data- 
base (Figure 1). 

Database content 

We identified 17 096 distinct predicted AltORFs in the CDS 
of 31422 mRNAs (41.2% of total human mRNAs) tran- 
scribed from 8744 genes (42.5% of total human genes). A 
total of 14195 (83%) are located in the +2 reading frame 
and 2901 (17%) are located in the +3 reading frame. 

For each AltORF, the gene name and accession number 
of the mRNA in which it is encoded are provided. Other 
information can also be found, including the reference pro- 
tein produced from the corresponding mRNA, the coordin- 
ates of the start and stop codon of both the reference ORF 
and the alternative ORF in the mRNA, and the predicted 
length and amino acid sequence of the alternative protein. 

Web interface 

The HAItORF database (http://haltorf.roucoulab.com/) can 
be searched by gene name or symbol, by mRNA or protein 
GenBank accession number, and by protein sequence (with 
a minimum of 5 amino acids). Detailed explanations on 
how to perform a search and how results are displayed 
are available on the website under the Documentation 
tab. The search results are summarized in a table contain- 
ing information for each retrieved AltORF, including the 
gene symbol, mRNA and reference protein accession 
numbers, reading frames, the location of the reference 
and alternative ORFs on the mRNA sequence, and the 



alternative protein length (Figure 2). The nucleotide num- 
bers indicating the location of the ORFs are the first nucleo- 
tide of the start codon, and the first nucleotide of the stop 
codon, respectively. If multiple transcript variants exist for a 
given gene, all variants containing an alternative ORF are 
listed. If a search by protein sequence is performed, the 
table includes a supplementary column displaying part of 
the alternative protein sequence matching the query 
sequence. For each retrieved alternative ORF, a detailed 
result page is accessible through a link and provides the 
user with basic information concerning the reference 
mRNA and protein. Links to the NCBI website are also pro- 
vided to help the user retrieve supplementary information 
on the gene, mRNA and reference protein associated with 
the AltORF. The detailed result page also contains an align- 
ment section where the reference and alternative protein 
sequences are aligned on the reference mRNA sequence 
(Figure 2). The complete HAItORF database can be freely 
downloaded in Microsoft Excel or FASTA format under the 
download tab. The complete MySQL data dump is also 
available in this section, thus providing developers with 
the possibility to predict other AltORFs using different par- 
ameters such as the length of AltORFs for example. 

Relevance and research avenues 

The number of predicted AltORFs present in HAItORF is 
much greater when compared to other studies (16-18). 
This can be explained by different reasons. In particular, 
we used a lower cut-off for the size of AltORFs, and 
chose not to consider criteria such as conservation among 
species and specific codon usage. However, in our ap- 
proach, we have established several limits, including AUG 
initiation codons located in an optimal Kozak context. 
Expression from AUG codons in the absence of an optimal 
Kozak sequence or from non-traditional CUG sites (21,22) is 
also possible and may be included in further studies. 
Nevertheless, the reduced stringency of our approach 
resulted in the successful prediction of AltPrP and ALEX, 
two experimentally well-characterized out-of-frame ATI 
products. It is likely that at least one of the several func- 
tions previously attributed to the prion protein is actually 
catalyzed by AltPrP (10), and we expect that some paradox- 
ical experimental results regarding the function of other 
genes might be explained by multiple coding as well. This 
example highlights the fact that conservation along evolu- 
tion of an alternative ORF is not necessary to be biologically 
relevant since the initiation codon for AltPrP is present in 
higher order mammals but not in lower mammals, includ- 
ing rodents (10). In addition, the presence of ALEX in 
HAItORF, for which polymorphisms have been associated 
with inherited neurological problems and increased 
trauma-related bleeding tendency (9), indicates that 
HAItORF could be valuable for the identification of 
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Number of results per page: 1 0 
I Search | 
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Gene 
symbol 


mRNA 
accession 
number 


Reference 
protein 

accession 
number 


Reference 
reading 
frame 


Reference 
ORF start - 
stop 
(nucleotides) 


Alternative 
reading 
frame 


Alternative 
ORF start - 
stop 
(nucleotides) 


Alternative 
protein 

length 
(amino 

acids) 




DEFBI04A 


NM_080389 


NP_52512B 


+ 1 


15-231 




109 - 190 




t: 


—A 



Detailed result for DEFB104A (NMJJ80389) 




Alternative ORF 109 - 190 




Alignment information 




Reference mRNA Reference protein Alternative protein 


Note: Letters corresponding to the amino acid sequence are aligned with the first 


nucleotide of the corresponding cod on in the nucleotide sequence, 


GCAGC C C CAGCATTATGCAGAGACTTGTGCTGCTATTAGC CATTTCTCTTCTACTCTATCAAGATCTTCCAGTGAGAAGC 


MQRLVLLLAISLLL 


YODLPVRS 


GAATTTGAATTGGAC AGMTATGTGGTTATGGGACTGC C C GTTGCC GGAAGAAATGTC GCAGCCAAGAATACAGAATTGG 


EFELDRIC GYGTARCRKKC 


R SQEYRIG 


MGLFVAGRNV 


AAKNTELE 


AAGATGTCCCAAC AC CTATGCATGCTGTTTGAGAJUATGGGATGAGAGCTTACTGAATC GTACAAAAC CCTGAAAC GCAG 


RC PNTYAC CLR KWD ES LLH 


R T K P 


DVPTPHHAV 




TAGTGCTGGTC C CTAGAGTC GCTGGAAGTAGGACCTC AGTA 





Figure 2. Snapshot of a typical search and associated results pages. (1) Search by gene (DEFB104A, which encodes the p-defensin 
104 protein). (2) The number of corresponding AltORFs is indicated, and details on each AltORF are summarized in a table. 
Although this is not the case for this particular example, note that for a single gene, all AltORFs present in each transcript 
variants would be listed. The reference ORF is by definition in the +1 frame, and the alternative ORFs is in the +2 frame in this 
example. The nucleotide numbers indicating the location of the ORFs are the first nucleotide of the start codon, and the first 
nucleotide of the stop codon, respectively. (3) A detailed result page is available for each AltORF through the 'View' link. (4) In 
the detailed result page, basic information on the gene and mRNA of origin as well as the associated reference protein are 
displayed along with links to GenBank for each of these items (not shown). An alignment of the reference (blue letters) and 
alternative (green letters) protein sequences on the reference mRNA sequence (black letters) is provided. 
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biologically important AltORFs in human genes with mul- 
tiple CDS. 

Last but not least, the complete database may help mass 
spectrometry services to identify the great proportion of 
unknown peptides in their data sets which cannot be cur- 
rently matched to any protein in existing databases. 
Altogether, HAItORF will help in the meticulous explor- 
ation of this potential alternative proteome which has 
been largely overlooked to date. 
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